White Wine Quality Exploration by Liya

This report explores a dataset containing 11 attributes and 1 output attribute for around 5000 instances of white wines. We will try to explore if any of these attributes related to quality of wines which is evaluated by human sensory, and how.

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median : 5.200   Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Our dataset consists of 13 variables and 4898 observations.

Except for the first variable X which is index of the records, we have 12 attribues of white wines for analysis.

Univariate Plots Section

1 - quality (score between 0 and 10)

quality as output attribute of the dataset, ranging from 0 to 10, but actually in this dataset only 3 to 9 score has been placed, and mostly from 5 to 7, other scores are rare. it’s roughly normally distributed.

quality is actually ordial descret variable, I tranformed it to factor so it will be easier for later analysis.

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

2 - fixed acidity (tartaric acid - g / dm^3)

Most acidity locates between 4 to 10, normally distributed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

3 - volatile acidity (acetic acid - g / dm^3)

Right skewed, with most value lower than 0.5, and peak around 0.25

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

4 - citric acid (g / dm^3)

Right skewed with long tail, and a strange peak around 0.49, zoom in to check the area, wonder why.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

5 - residual sugar (g / dm^3)

Highly right skewed, with most value less than 3, and just a few outlier. Number of records decreases as residual sugar grows.

But when we transform x-axis to log scale, a bimodal appears with two peaks around 2 and 9, and a valley around 3. We’ll analyais later what caused that shape.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

6 - chlorides (sodium chloride - g / dm^3)

Right skewed with most value lower than 0.1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

7 - free sulfur dioxide (mg / dm^3), total sulfur dioxide (mg / dm^3)

Both data are left skewed with similiar shape. Consider free sulfur dioxide as part of total sulfur dioxide, we’re insterested at the ratio.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

8 - free sulfur dioxide rate

We devide total sulfur dioxide by free sulfur dioxide to get new variant free sulfur dioxide rate.

Histogram shows it’s lightly right skewed with a few outliers, the ratio of 30% is most popular.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0200  0.1900  0.2500  0.2556  0.3200  0.7100

9 - density (g / cm^3)

Right skewed with very long tail. This is reasonable since wines are mostly water, their density should be very much close to 1 g/cm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

10 - pH

The value distribution is symetric around 3.15

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

11 - sulphates (potassium sulphate - g / dm3)

Right skewed with peak around 0.5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

12 - alcohol (% by volume)

Right skewed, we canroughly recognize 3 parts, log scale didn’t show more interesting things

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Univariate Analysis

What is the structure of your dataset?

This dataset includes 4,898 instances of white wine. All 12 attributes of wines are continuous numerical, with 1 descret output attribute quality evaluated by score from 0 to 10. Most attributes are skewed (especailly right skewed).

What is/are the main feature(s) of interest in your dataset?

I’m curious how the quality scored by human sensory, is affected by each physical attributes, if any.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There should be correlations exists between some of the attributes themselves, I’ll try to figure them out.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new variable represeting the rates between free and total of sulfur dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When I perfromed log sacle to residual sugar, a bimodal shape appeared! There is also an unusual peak with citric acid. They’re intereing me, wondering about the cause. Yes I did transformed quality variable from numerical to factor since it’s actually discrete ordinal gradings.

Bivariate Plots Section

13 - A Matrix of all attributes

The matrix provides an overview to the correlatons between each pairs of variables. Concerning quality, the only linear correlated variable (r > 0.3) is alcohol. It also shows linear correlations (except for several artifacts) between density and residual.sugar, chlorides, alcohol, between ‘pH’ and fixed.acidity, between fixed.acidity and ’free.sulfur.dioxides.rate`

##                          fixed.acidity volatile.acidity  citric.acid
## fixed.acidity               1.00000000      -0.02269729  0.289180698
## volatile.acidity           -0.02269729       1.00000000 -0.149471811
## citric.acid                 0.28918070      -0.14947181  1.000000000
## residual.sugar              0.08902070       0.06428606  0.094211624
## chlorides                   0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide        -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide        0.09106976       0.08926050  0.121130798
## density                     0.26533101       0.02711385  0.149502571
## pH                         -0.42585829      -0.03191537 -0.163748211
## sulphates                  -0.01714299      -0.03572815  0.062330940
## alcohol                    -0.12088112       0.06771794 -0.075728730
## quality                    -0.11366283      -0.19472297 -0.009209091
## free.sulfur.dioxide.rate   -0.13909280      -0.19553198  0.016383115
##                          residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity                0.08902070  0.02308564       -0.0493958591
## volatile.acidity             0.06428606  0.07051157       -0.0970119393
## citric.acid                  0.09421162  0.11436445        0.0940772210
## residual.sugar               1.00000000  0.08868454        0.2990983537
## chlorides                    0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide          0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide         0.40143931  0.19891030        0.6155009650
## density                      0.83896645  0.25721132        0.2942104109
## pH                          -0.19413345 -0.09043946       -0.0006177961
## sulphates                   -0.02666437  0.01676288        0.0592172458
## alcohol                     -0.45063122 -0.36018871       -0.2501039415
## quality                     -0.09757683 -0.20993441        0.0081580671
## free.sulfur.dioxide.rate     0.05196231 -0.03363087        0.7386123078
##                          total.sulfur.dioxide     density            pH
## fixed.acidity                     0.091069756  0.26533101 -0.4258582910
## volatile.acidity                  0.089260504  0.02711385 -0.0319153683
## citric.acid                       0.121130798  0.14950257 -0.1637482114
## residual.sugar                    0.401439311  0.83896645 -0.1941334540
## chlorides                         0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide               0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide              1.000000000  0.52988132  0.0023209718
## density                           0.529881324  1.00000000 -0.0935914935
## pH                                0.002320972 -0.09359149  1.0000000000
## sulphates                         0.134562367  0.07449315  0.1559514973
## alcohol                          -0.448892102 -0.78013762  0.1214320987
## quality                          -0.174737218 -0.30712331  0.0994272457
## free.sulfur.dioxide.rate         -0.012930593 -0.06535628  0.0004666430
##                            sulphates     alcohol      quality
## fixed.acidity            -0.01714299 -0.12088112 -0.113662831
## volatile.acidity         -0.03572815  0.06771794 -0.194722969
## citric.acid               0.06233094 -0.07572873 -0.009209091
## residual.sugar           -0.02666437 -0.45063122 -0.097576829
## chlorides                 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide       0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide      0.13456237 -0.44889210 -0.174737218
## density                   0.07449315 -0.78013762 -0.307123313
## pH                        0.15595150  0.12143210  0.099427246
## sulphates                 1.00000000 -0.01743277  0.053677877
## alcohol                  -0.01743277  1.00000000  0.435574715
## quality                   0.05367788  0.43557472  1.000000000
## free.sulfur.dioxide.rate -0.02261212  0.06489615  0.197649851
##                          free.sulfur.dioxide.rate
## fixed.acidity                        -0.139092802
## volatile.acidity                     -0.195531983
## citric.acid                           0.016383115
## residual.sugar                        0.051962312
## chlorides                            -0.033630873
## free.sulfur.dioxide                   0.738612308
## total.sulfur.dioxide                 -0.012930593
## density                              -0.065356278
## pH                                    0.000466643
## sulphates                            -0.022612119
## alcohol                               0.064896146
## quality                               0.197649851
## free.sulfur.dioxide.rate              1.000000000

14 - density

We can detect several linear correlations pairing with density, let’s plot them.

resudual.sugar , chlorides and total.sulfur.dioxide are increase as density increses, that is reasonable considerinh the condensinh of dry materials, but alcohol is reversed.

While noticing the similiar shape of resudual.sugar and alcohol plots, I ploted the 5th graph to see if they’re correlated too. The result suports the negative correlation. This is reasonable since I get knowledged on internet that: sugar will tranform to alcohol during the fermentation, and alcohol is ligher than water, that brings down the overall density of the wine.

15 - fixed.acidity vs. pH

Scatter plots show their negative correlation. This is totally understandable since acidity means low pH value, but we can’t see such correlation of pH with two other acidity attributes.

16 - Concerning the bimodal of residual.sugar

The bimodal implies there could be different categories exist among wines.

I get some information from internet that wines are usually classified to following categories accroding to their residual sugar:
- dry (< 4 g/dm^3)
- semi-dry (4 ~ 12 g/dm^3)
- semi-sweet (12 ~ 45 g/dm^3)
- sweet (> 45 g/dm^3)

Let’s create a new variable category by cutting the residual.sugar values.

##   residual.sugar   category
## 1           20.7 semi-sweet
## 2            1.6        dry
## 3            6.9   semi-dry
## 4            8.5   semi-dry
## 5            8.5   semi-dry
## 
##        dry   semi-dry semi-sweet      sweet 
##       2097       1975        825          0

There is only one case classified as ‘sweet’, I removed this case since it has no meaning for following analysis.

Evidently, wine belongs to dry and semi-dry are distributed on both sides of the valley, and we even noticed a 3rd part for semi-sweet. That explains the bimodal shape.

17 - quality

As we can see through the matrix, there is no remarkable correlations detected by eyes, maybe except for alcohol.

I performed a cor test focusing on quality(as numerical value instead of categorical), that supports the visual conclustion, only alcohol get a r value obviously greater than 0.3, along with density mildly past -0.3. Let’s plot for them.

## # A tibble: 12 x 2
##    rowname                   quality
##    <chr>                       <dbl>
##  1 fixed.acidity            -0.114  
##  2 volatile.acidity         -0.195  
##  3 citric.acid              -0.00921
##  4 residual.sugar           -0.0976 
##  5 chlorides                -0.210  
##  6 free.sulfur.dioxide       0.00816
##  7 total.sulfur.dioxide     -0.175  
##  8 density                  -0.307  
##  9 pH                        0.0994 
## 10 sulphates                 0.0537 
## 11 alcohol                   0.436  
## 12 free.sulfur.dioxide.rate  0.198

As we can see, the quality score descreses firstly before reach to valley of 5, then after that, the score grows up steadily.

And we can also see, quality responds to density is almost reversed. This is reasonable as we already know density and alcohol is correlated themselves.

So which one really play a role on affecting quality?

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We set off hoping to figure out how the 12 attributes affect the quality scores. But actually I only this find one attribue alcohol is playing the role. So either there’re other useful physical attributes not included, or we may guess, human sensory is not that reliable.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Devid wines to 4 categories based on their residual sugar explained the bimodal we detected on univariate analysis. An the relationship between residual sugar, alcohol and density is intereting.

What was the strongest relationship you found?

Between residual.sugar and density, with r value as high as 0.839.

Multivariate Plots Section

18 - Explore sulfur dioxide and sulphates

Since sulpates is described as “a wine additive which can contribute to sulfur dioxide gas (S02) levels”, we’re expected to see some relationship between them.

We start by creating a new variable bound.sulfur.dioxide by subtracting free.sulfur.dioxide from total.sulfur.dioxide

That is disappointing, we didn’t reveal the linear correlation between them.

We made another plot by seperating free and bound sulfur dioxide, and take a log to sulphates, it only shows a mildly positive relationship between bound sulfur and sulphates.

19 sulfur dioxide and residual sugar

We can see both free and bound sulfur dioxide are increasing with resigual sugar (category). Since sulfur dioxide prevents microbial growth and the oxidation of wine, it seems reasonable that wines with more sugar need more sulfur dioxide.

This plot shows the tendancy of lower total sulfur dioxide with higher quality. And the change is mostly contributes by bound ones, the free ones almost keeps no change across different qualities.

I find follwing information regarding Sulfites in Wine

  • Other factors that affect how much sulfite is needed are the residual sugar and the acidity of the wine. Dryer wines with more acid will tend to be lower in sulfites. Sweet wines and dessert wines, on the other hand, tend to be quite high in sulfites.

Let’s plot to see if that’s true for our data:

We do see sweeter wines gathering around higher end of sulfur dioxide, but we can’t see they gathering around more-acid end (lower pH). The reason is unknown.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We found sulfur dioxide is also affected by residual sugar, and it also strengthened the correlation we found between residual sugar and quality.

Were there any interesting or surprising interactions between features?

I tried to rediscover the relationship between sulfur dioxide, acid and residual sugar, it did shows the correlation between sulfur dioxide/sugar, but failed with acid/sugar.

Final Plots and Summary

Plot One

Description One

Residual Sugar is not evenly distributed among this dataset, but gathering around three peaks which presents 3 categories: dry, semi-dry, and semi-sweet.

Plot Two

Description Two

Density of wines are usually under 1 g/cm^3 (density of water). And it’s one of the outstanding feature that correlates with wine’s quality score. We can see wines with high scores tends to have lower density(which also indicate lower residual sugar and higher alcohol). We can also notice a slightly bimodal density with higher score wines, that also demonstrates the bimodal/trimodal situation we revealed with Plot One.

Plot Three

Description Three

Wines with higher quality score tens to contain lower total sulfur dioxide, the differences are mostly contributed by bound sulfur dioxide, while the free ones almost keeps no change across different scores.


Reflection

I set out this analysis with expectation to find factors that impacts white wine quality. It turned out that a series of features correlated each other, like residual sugar, alcohol, density, sulfur dioxide, influence the scores together. It’s hard to tell which one actually affects human tastes most.

During the analysis, we did reveal some industry experiences or physical rules with wines, for example, the more sugar consumed during fermenting, the more alcohol generated, which condensed the wine, and decreased it’s density. But I failed to reveal the relationships amond sugar, sulfur and acid.

The whole dataset contains around 5000 of record, but only includes 1 case that can be classified at ‘sweet’ wine. With more records on sweet wines we might able to detect more correlations that is not noticable now. And to compare the white wine dataset with red wine might also help us to find more interesting features regarding wines.